{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PDF Extraction and Ingest with ELSER Example\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/ingestion-and-chunking/pdf-chunking-ingest.ipynb)\n",
    "\n",
    "This workbook demonstrates how to extract the contents of a single PDF, create passages and ingest into Elasticsearch. \n",
    "\n",
    "In this example we will:\n",
    "- load the PDF using pypdf\n",
    "- chunk the text with LangChain document splitter\n",
    "- ingest into Elasticsearch with LangChain Elasticsearch Vectorstore. \n",
    "\n",
    "We will also setup your Elasticsearch cluster with ELSER model, so we can use it to embed the passages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "zQlYpYkI46Ff",
    "outputId": "83677846-8a6a-4b49-fde0-16d473778814"
   },
   "outputs": [],
   "source": [
    "!pip install -qU  pypdf langchain_community langchain \"elasticsearch<9\" tiktoken langchain-elasticsearch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GCZR7-zK810e"
   },
   "source": [
    "## Connecting to Elasticsearch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "id": "DofNZ2w25nIr"
   },
   "outputs": [],
   "source": [
    "from elasticsearch import Elasticsearch\n",
    "from getpass import getpass\n",
    "\n",
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
    "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
    "\n",
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
    "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
    "\n",
    "client = Elasticsearch(\n",
    "    # For local development\n",
    "    # \"http://localhost:9200\",\n",
    "    # basic_auth=(\"elastic\", \"changeme\")\n",
    "    cloud_id=ELASTIC_CLOUD_ID,\n",
    "    api_key=ELASTIC_API_KEY,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zv6hKYWr8-Mg"
   },
   "source": [
    "## Deploying ELSER"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1U4ffD2K9BkJ"
   },
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "model = \".elser_model_2\"\n",
    "\n",
    "try:\n",
    "    client.ml.put_trained_model(model_id=model, input={\"field_names\": [\"text_field\"]})\n",
    "except:\n",
    "    pass\n",
    "\n",
    "while True:\n",
    "    status = client.ml.get_trained_models(model_id=model, include=\"definition_status\")\n",
    "\n",
    "    if status[\"trained_model_configs\"][0][\"fully_defined\"]:\n",
    "        print(model + \" is downloaded and ready to be deployed.\")\n",
    "        break\n",
    "    else:\n",
    "        print(model + \" is downloading or not ready to be deployed.\")\n",
    "    time.sleep(5)\n",
    "\n",
    "client.ml.start_trained_model_deployment(\n",
    "    model_id=model, number_of_allocations=1, wait_for=\"starting\"\n",
    ")\n",
    "\n",
    "while True:\n",
    "    status = client.ml.get_trained_models_stats(\n",
    "        model_id=model,\n",
    "    )\n",
    "    if status[\"trained_model_stats\"][0][\"deployment_stats\"][\"state\"] == \"started\":\n",
    "        print(model + \" has been successfully deployed.\")\n",
    "        break\n",
    "    else:\n",
    "        print(model + \" is currently being deployed.\")\n",
    "    time.sleep(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wqYXqJxn9JsA"
   },
   "source": [
    "## Importing PDF chunks into Index\n",
    "This will load the PDF from the url provided, and then chunk the text into passage docs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "id": "7bN32vunqIk2"
   },
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "# Change to any PDF of your choice\n",
    "loader = PyPDFLoader(\"https://arxiv.org/pdf/2103.15348.pdf\")\n",
    "\n",
    "data = loader.load()\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
    "    chunk_size=512, chunk_overlap=256\n",
    ")\n",
    "docs = loader.load_and_split(text_splitter=text_splitter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ingesting the passages into Elasticsearch\n",
    "This will ingest the passage docs into the Elasticsearch index, under the specified INDEX_NAME."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0xtdeIJI9N9-"
   },
   "outputs": [],
   "source": [
    "from langchain_elasticsearch import ElasticsearchStore\n",
    "\n",
    "INDEX_NAME = \"pdf_chunked_index\"\n",
    "\n",
    "ElasticsearchStore.from_documents(\n",
    "    docs,\n",
    "    es_connection=client,\n",
    "    index_name=INDEX_NAME,\n",
    "    strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=model),\n",
    "    bulk_kwargs={\n",
    "        \"request_timeout\": 60,\n",
    "    },\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "include_colab_link": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
